a: wc Shakespeare.txt: 124456 bytes, 901325 lines, 5458199 words
b: grep -c -i "by William Shakespeare" Shakespeare.txt: 38
The first part of the pipeline searches all occurences of "by William Shakespeare" in Shakespeare.txt. The Parameter -B 6 further includes the 6 lines before the selected pattern. The output of the first part is then matched again with a expression in three pieces:
The regex ^$ is denied by the option -e so all empty lines are excluded. The regex tr '\n' ' ' is used to replace all newlines with a space. The regex sed 's/ -- /\n/g' is used to replace all double dashes with a newline.
The execution involes 4 processes.
a) monotonically_increasing_id => filter b) filter(... isin copyright) c) withColumn when contains "by William Shakespeare" lag... d+e) repartition => groupBy => agg
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, col, lag, when, desc,split, explode,countDistinct,row_number, max, regexp_replace, last, lower
from pyspark.sql.window import Window
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
sc = spark.sparkContext
df1 = spark.read.text("Shakespeare.txt") \
.withColumn("rowId", monotonically_increasing_id()) \
.filter(col("rowId") >245 ) \
.filter(col("value")!="")
df1.show()
+--------------------+-----+ | value|rowId| +--------------------+-----+ | THE SONNETS| 246| |by William Shakes...| 248| | ...| 252| | From fairest cr...| 253| | That thereby be...| 254| | But as the ripe...| 255| | His tender heir...| 256| | But thou contra...| 257| | Feed'st thy lig...| 258| | Making a famine...| 259| | Thy self thy fo...| 260| | Thou that art n...| 261| | And only herald...| 262| | Within thine ow...| 263| | And tender chur...| 264| | Pity the worl...| 265| | To eat the wo...| 266| | ...| 269| | When forty wint...| 270| | And dig deep tr...| 271| +--------------------+-----+ only showing top 20 rows
copyright = '''<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
PROVIDED BY PROJECT GUTENBERG ETEXT OF ILLINOIS BENEDICTINE COLLEGE
WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY. PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>'''
copyright = [line for line in copyright.split("\n")]
df1.filter(col("value").isin(copyright)).show()
df1=df1.filter(~col("value").isin(copyright))
df1.show()
+--------------------+-----+ | value|rowId| +--------------------+-----+ |<<THIS ELECTRONIC...| 2873| |SHAKESPEARE IS CO...| 2874| |PROVIDED BY PROJE...| 2875| |WITH PERMISSION. ...| 2876| |DISTRIBUTED SO LO...| 2877| |PERSONAL USE ONLY...| 2878| |COMMERCIALLY. PR...| 2879| |SERVICE THAT CHAR...| 2880| |<<THIS ELECTRONIC...| 2919| |SHAKESPEARE IS CO...| 2920| |PROVIDED BY PROJE...| 2921| |WITH PERMISSION. ...| 2922| |DISTRIBUTED SO LO...| 2923| |PERSONAL USE ONLY...| 2924| |COMMERCIALLY. PR...| 2925| |SERVICE THAT CHAR...| 2926| |<<THIS ELECTRONIC...| 3521| |SHAKESPEARE IS CO...| 3522| |PROVIDED BY PROJE...| 3523| |WITH PERMISSION. ...| 3524| +--------------------+-----+ only showing top 20 rows +--------------------+-----+ | value|rowId| +--------------------+-----+ | THE SONNETS| 246| |by William Shakes...| 248| | ...| 252| | From fairest cr...| 253| | That thereby be...| 254| | But as the ripe...| 255| | His tender heir...| 256| | But thou contra...| 257| | Feed'st thy lig...| 258| | Making a famine...| 259| | Thy self thy fo...| 260| | Thou that art n...| 261| | And only herald...| 262| | Within thine ow...| 263| | And tender chur...| 264| | Pity the worl...| 265| | To eat the wo...| 266| | ...| 269| | When forty wint...| 270| | And dig deep tr...| 271| +--------------------+-----+ only showing top 20 rows
df1 = df1.withColumn("title",
when(col("value").contains("by William Shakespeare"),
lag(col("value"),1).over(Window.orderBy('rowId')))) \
.withColumn("title", last("title", ignorenulls=True).over(Window.orderBy("rowId")))\
.dropna()
partitions = df1.agg(countDistinct(col("title"))).collect()[0][0]
df1=df1.repartition(partitions,"title")
df1.rdd.getNumPartitions()
df1.show()
df1.rdd.getNumPartitions()
+--------------------+-----+--------------------+ | value|rowId| title| +--------------------+-----+--------------------+ |by William Shakes...|31856|SECOND PART OF KI...| | Dramatis Personae|31860|SECOND PART OF KI...| | RUMOUR, the Pre...|31862|SECOND PART OF KI...| | KING HENRY THE ...|31863|SECOND PART OF KI...| | HENRY, PRINCE O...|31865|SECOND PART OF KI...| | PRINCE JOHN OF ...|31866|SECOND PART OF KI...| | PRINCE HUMPHREY...|31867|SECOND PART OF KI...| | THOMAS, DUKE OF...|31868|SECOND PART OF KI...| | Sons of Henry IV|31869|SECOND PART OF KI...| | EARL OF NORTHUM...|31871|SECOND PART OF KI...| | SCROOP, ARCHBIS...|31872|SECOND PART OF KI...| | LORD MOWBRAY|31873|SECOND PART OF KI...| | LORD HASTINGS|31874|SECOND PART OF KI...| | LORD BARDOLPH|31875|SECOND PART OF KI...| | SIR JOHN COLVILLE|31876|SECOND PART OF KI...| | TRAVERS and MOR...|31877|SECOND PART OF KI...| | Opposites aga...|31878|SECOND PART OF KI...| | EARL OF WARWICK|31880|SECOND PART OF KI...| | EARL OF WESTMOR...|31881|SECOND PART OF KI...| | EARL OF SURREY|31882|SECOND PART OF KI...| +--------------------+-----+--------------------+ only showing top 20 rows
38
df1 = df1.withColumn("rowId", row_number().over(Window.partitionBy("title").orderBy("rowId"))) \
.select("rowId","value","title", split("value", '\s+').alias('split')) \
.select("rowId","value","title",explode("split").alias('word')) \
.where(col("word") != '') \
.withColumn("word",regexp_replace(col("word"),"[^\w\s]", "")) \
.withColumn("wordId", row_number().over(Window.partitionBy("title").orderBy("rowId")))
df1.show()
+-----+--------------------+--------------------+-----------+------+ |rowId| value| title| word|wordId| +-----+--------------------+--------------------+-----------+------+ | 1|by William Shakes...|SECOND PART OF KI...| by| 1| | 1|by William Shakes...|SECOND PART OF KI...| William| 2| | 1|by William Shakes...|SECOND PART OF KI...|Shakespeare| 3| | 2| Dramatis Personae|SECOND PART OF KI...| Dramatis| 4| | 2| Dramatis Personae|SECOND PART OF KI...| Personae| 5| | 3| RUMOUR, the Pre...|SECOND PART OF KI...| RUMOUR| 6| | 3| RUMOUR, the Pre...|SECOND PART OF KI...| the| 7| | 3| RUMOUR, the Pre...|SECOND PART OF KI...| Presenter| 8| | 4| KING HENRY THE ...|SECOND PART OF KI...| KING| 9| | 4| KING HENRY THE ...|SECOND PART OF KI...| HENRY| 10| | 4| KING HENRY THE ...|SECOND PART OF KI...| THE| 11| | 4| KING HENRY THE ...|SECOND PART OF KI...| FOURTH| 12| | 5| HENRY, PRINCE O...|SECOND PART OF KI...| HENRY| 13| | 5| HENRY, PRINCE O...|SECOND PART OF KI...| PRINCE| 14| | 5| HENRY, PRINCE O...|SECOND PART OF KI...| OF| 15| | 5| HENRY, PRINCE O...|SECOND PART OF KI...| WALES| 16| | 5| HENRY, PRINCE O...|SECOND PART OF KI...| afterwards| 17| | 5| HENRY, PRINCE O...|SECOND PART OF KI...| HENRY| 18| | 6| PRINCE JOHN OF ...|SECOND PART OF KI...| PRINCE| 19| | 6| PRINCE JOHN OF ...|SECOND PART OF KI...| JOHN| 20| +-----+--------------------+--------------------+-----------+------+ only showing top 20 rows
df1.groupBy("title")\
.agg(
max(col("rowId")),
max(col("wordId"))
).show()
+--------------------+----------+-----------+ | title|max(rowId)|max(wordId)| +--------------------+----------+-----------+ |SECOND PART OF KI...| 3286| 27632| |THE TRAGEDY OF OT...| 3636| 27863| |LOVE'S LABOUR'S LOST| 2832| 22888| |THE FIRST PART OF...| 3082| 25719| |THE TAMING OF THE...| 2776| 22156| | A LOVER'S COMPLAINT| 332| 2575| |THE COMEDY OF ERRORS| 1922| 16202| |THE MERCHANT OF V...| 2740| 22255| |THE TRAGEDY OF AN...| 3851| 26460| |THE SECOND PART O...| 3324| 26784| |THE TRAGEDY OF TI...| 2759| 21635| | THE SONNETS| 2319| 17747| | CYMBELINE| 3842| 28792| |KING HENRY THE EI...| 3492| 25815| | THE TEMPEST| 2442| 17419| | MEASURE FOR MEASURE| 2880| 22874| | KING RICHARD III| 4161| 31093| |MUCH ADO ABOUT NO...| 2551| 22453| |A MIDSUMMER NIGHT...| 2245| 17233| |THE TRAGEDY OF KI...| 3641| 27511| +--------------------+----------+-----------+ only showing top 20 rows
df1.groupBy("title")\
.agg(
max(col("rowId")),
) \
.orderBy(desc("max(rowId)")).show()
+--------------------+----------+ | title|max(rowId)| +--------------------+----------+ |THE TRAGEDY OF HA...| 4169| | KING RICHARD III| 4161| |THE TRAGEDY OF CO...| 3916| |THE TRAGEDY OF AN...| 3851| | CYMBELINE| 3842| |THE HISTORY OF TR...| 3643| |THE TRAGEDY OF KI...| 3641| |THE TRAGEDY OF OT...| 3636| |KING HENRY THE EI...| 3492| | THE WINTER'S TALE| 3371| |THE SECOND PART O...| 3324| |THE LIFE OF KING ...| 3320| |THE TRAGEDY OF RO...| 3287| |SECOND PART OF KI...| 3286| |THE THIRD PART OF...| 3192| |THE FIRST PART OF...| 3082| |THE FIRST PART OF...| 3062| |KING RICHARD THE ...| 2985| |ALLS WELL THAT EN...| 2958| | MEASURE FOR MEASURE| 2880| +--------------------+----------+ only showing top 20 rows
Stopwords are common words that add no meaning to a text. Therefore they are often removend during text mining. A collection of stopword-lists is avaible in this repo: https://github.com/stopwords-iso/stopwords-en/tree/master/raw
stopwords = spark.read.text("stopwords-en.txt").withColumnRenamed("value", "word")
df1 = df1.withColumn("word", lower(col("word")))
df1=df1.join(stopwords, on='word', how='left_anti')
df1.groupBy("title","word").count().sort(desc("count")).show()
+--------------------+---------+-----+ | title| word|count| +--------------------+---------+-----+ |THE TRAGEDY OF AN...| antony| 375| |TWELFTH NIGHT; OR...| sir| 366| |THE TRAGEDY OF JU...| brutus| 366| |THE TRAGEDY OF OT...| iago| 362| |THE TRAGEDY OF HA...| ham| 358| |THE THIRD PART OF...| king| 343| |THE TRAGEDY OF OT...| othello| 324| |THE LIFE OF TIMON...| timon| 306| | THE SONNETS| thy| 287| |THE TRAGEDY OF MA...| macbeth| 283| |THE LIFE OF KING ...| king| 282| |THE MERRY WIVES O...| ford| 274| | AS YOU LIKE IT| rosalind| 264| | MEASURE FOR MEASURE| duke| 261| |THE TRAGEDY OF AN...|cleopatra| 260| | KING RICHARD III| king| 258| |THE TRAGEDY OF AN...| caesar| 254| |KING RICHARD THE ...| king| 253| |THE HISTORY OF TR...| troilus| 245| |THE TRAGEDY OF OT...| cassio| 243| +--------------------+---------+-----+ only showing top 20 rows
Obviously the stopwords lexicon wasn't to sensible. Thy is a word that should be removed as well. Research reveals a stopword lexica for middle english might be useful. The python package cltk provides several for older languages... Otherwise there seems to be a small issue with hamlet.
from cltk.stops.words import Stops
stops_obj = Stops(iso_code="enm")
stopwords2= stops_obj.get_stopwords()
df1 = df1.filter(~col("word").isin(stopwords2))
df1.groupBy("title","word").count().sort(desc("count")).show()
+--------------------+---------+-----+ | title| word|count| +--------------------+---------+-----+ |THE TRAGEDY OF AN...| antony| 375| |TWELFTH NIGHT; OR...| sir| 366| |THE TRAGEDY OF JU...| brutus| 366| |THE TRAGEDY OF OT...| iago| 362| |THE TRAGEDY OF HA...| ham| 358| |THE THIRD PART OF...| king| 343| |THE TRAGEDY OF OT...| othello| 324| |THE LIFE OF TIMON...| timon| 306| |THE TRAGEDY OF MA...| macbeth| 283| |THE LIFE OF KING ...| king| 282| |THE MERRY WIVES O...| ford| 274| | AS YOU LIKE IT| rosalind| 264| | MEASURE FOR MEASURE| duke| 261| |THE TRAGEDY OF AN...|cleopatra| 260| | KING RICHARD III| king| 258| |THE TRAGEDY OF AN...| caesar| 254| |KING RICHARD THE ...| king| 253| |THE HISTORY OF TR...| troilus| 245| |THE TRAGEDY OF OT...| cassio| 243| |THE THIRD PART OF...| warwick| 240| +--------------------+---------+-----+ only showing top 20 rows
No more thy!
top_ten=df1.groupBy("title","word").count().sort(desc("count")).limit(10).select("title").rdd.flatMap(list).collect()
data = df1.groupBy("title","word").count() \
.withColumn("row",row_number().over(Window.partitionBy("title").orderBy(col("count").desc()))) \
.filter(col("row")<=10) \
.filter(col("title").isin(top_ten))
data.show()
+--------------------+---------+-----+---+ | title| word|count|row| +--------------------+---------+-----+---+ |THE TRAGEDY OF OT...| iago| 362| 1| |THE TRAGEDY OF OT...| othello| 324| 2| |THE TRAGEDY OF OT...| cassio| 243| 3| |THE TRAGEDY OF OT...|desdemona| 226| 4| |THE TRAGEDY OF OT...| emilia| 139| 5| |THE TRAGEDY OF OT...| roderigo| 105| 6| |THE TRAGEDY OF OT...| lord| 91| 7| |THE TRAGEDY OF OT...| love| 78| 8| |THE TRAGEDY OF OT...| hath| 66| 9| |THE TRAGEDY OF OT...| heaven| 60| 10| |THE TRAGEDY OF AN...| antony| 375| 1| |THE TRAGEDY OF AN...|cleopatra| 260| 2| |THE TRAGEDY OF AN...| caesar| 254| 3| |THE TRAGEDY OF AN...|enobarbus| 154| 4| |THE TRAGEDY OF AN...| charmian| 104| 5| |THE TRAGEDY OF AN...| enter| 101| 6| |THE TRAGEDY OF AN...| sir| 79| 7| |THE TRAGEDY OF AN...| pompey| 75| 8| |THE TRAGEDY OF AN...|messenger| 65| 9| |THE TRAGEDY OF AN...| eros| 64| 10| +--------------------+---------+-----+---+ only showing top 20 rows
graphdata = data.toPandas()
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
fig = px.bar(graphdata, x="word", y="count", color= "title", facet_row="title",
labels = {
'count': '',
'title': '',
},
width=1920, height=800)
fig.update_xaxes(matches=None)
fig.for_each_xaxis(lambda xaxis: xaxis.update(showticklabels=True))
fig.for_each_annotation(lambda a: a.update(text=""))
fig.show()